Put your name and student ID here

Name: Mohammad Jawad Nayosh

Student ID: T01242238

Summary This project aimed to predict median house values in California districts using various features. I explored the dataset, visualized geographical data, prepared the data for machine learning, trained several models, and evaluated their performance. The final visualization highlighted the geographical distribution of median house values and the predicted values generated by your chosen model.

Python Environment

install python packages

In [2]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [3]:
%cd "/content/drive/MyDrive/ANLY 6110_ II/Module 4 Training Model /M4P2 EtE Machine Learning Project"
/content/drive/MyDrive/ANLY 6110_ II/Module 4 Training Model /M4P2 EtE Machine Learning Project
In [4]:
%ls
cb_2023_us_state_500k.zip                                  housing.csv
Ch2_An_end_to_end_machine_learning_project_finisher.ipynb  map/
In [5]:
%pwd
Out[5]:
'/content/drive/MyDrive/ANLY 6110_ II/Module 4 Training Model /M4P2 EtE Machine Learning Project'
In [4]:
%pip install geopandas
%pip install contextily
%pip install mapclassify
Requirement already satisfied: geopandas in /usr/local/lib/python3.11/dist-packages (1.0.1)
Requirement already satisfied: numpy>=1.22 in /usr/local/lib/python3.11/dist-packages (from geopandas) (1.26.4)
Requirement already satisfied: pyogrio>=0.7.2 in /usr/local/lib/python3.11/dist-packages (from geopandas) (0.10.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.11/dist-packages (from geopandas) (24.2)
Requirement already satisfied: pandas>=1.4.0 in /usr/local/lib/python3.11/dist-packages (from geopandas) (2.2.2)
Requirement already satisfied: pyproj>=3.3.0 in /usr/local/lib/python3.11/dist-packages (from geopandas) (3.7.0)
Requirement already satisfied: shapely>=2.0.0 in /usr/local/lib/python3.11/dist-packages (from geopandas) (2.0.7)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas>=1.4.0->geopandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas>=1.4.0->geopandas) (2025.1)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas>=1.4.0->geopandas) (2025.1)
Requirement already satisfied: certifi in /usr/local/lib/python3.11/dist-packages (from pyogrio>=0.7.2->geopandas) (2025.1.31)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->pandas>=1.4.0->geopandas) (1.17.0)
Collecting contextily
  Downloading contextily-1.6.2-py3-none-any.whl.metadata (2.9 kB)
Requirement already satisfied: geopy in /usr/local/lib/python3.11/dist-packages (from contextily) (2.4.1)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.11/dist-packages (from contextily) (3.10.0)
Collecting mercantile (from contextily)
  Downloading mercantile-1.2.1-py3-none-any.whl.metadata (4.8 kB)
Requirement already satisfied: pillow in /usr/local/lib/python3.11/dist-packages (from contextily) (11.1.0)
Collecting rasterio (from contextily)
  Downloading rasterio-1.4.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.1 kB)
Requirement already satisfied: requests in /usr/local/lib/python3.11/dist-packages (from contextily) (2.32.3)
Requirement already satisfied: joblib in /usr/local/lib/python3.11/dist-packages (from contextily) (1.4.2)
Requirement already satisfied: xyzservices in /usr/local/lib/python3.11/dist-packages (from contextily) (2025.1.0)
Requirement already satisfied: geographiclib<3,>=1.52 in /usr/local/lib/python3.11/dist-packages (from geopy->contextily) (2.0)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib->contextily) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.11/dist-packages (from matplotlib->contextily) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib->contextily) (4.55.8)
Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib->contextily) (1.4.8)
Requirement already satisfied: numpy>=1.23 in /usr/local/lib/python3.11/dist-packages (from matplotlib->contextily) (1.26.4)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib->contextily) (24.2)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib->contextily) (3.2.1)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.11/dist-packages (from matplotlib->contextily) (2.8.2)
Requirement already satisfied: click>=3.0 in /usr/local/lib/python3.11/dist-packages (from mercantile->contextily) (8.1.8)
Collecting affine (from rasterio->contextily)
  Downloading affine-2.4.0-py3-none-any.whl.metadata (4.0 kB)
Requirement already satisfied: attrs in /usr/local/lib/python3.11/dist-packages (from rasterio->contextily) (25.1.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.11/dist-packages (from rasterio->contextily) (2025.1.31)
Collecting cligj>=0.5 (from rasterio->contextily)
  Downloading cligj-0.7.2-py3-none-any.whl.metadata (5.0 kB)
Collecting click-plugins (from rasterio->contextily)
  Downloading click_plugins-1.1.1-py2.py3-none-any.whl.metadata (6.4 kB)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests->contextily) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests->contextily) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests->contextily) (2.3.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.7->matplotlib->contextily) (1.17.0)
Downloading contextily-1.6.2-py3-none-any.whl (17 kB)
Downloading mercantile-1.2.1-py3-none-any.whl (14 kB)
Downloading rasterio-1.4.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (22.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22.2/22.2 MB 41.1 MB/s eta 0:00:00
Downloading cligj-0.7.2-py3-none-any.whl (7.1 kB)
Downloading affine-2.4.0-py3-none-any.whl (15 kB)
Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)
Installing collected packages: mercantile, cligj, click-plugins, affine, rasterio, contextily
Successfully installed affine-2.4.0 click-plugins-1.1.1 cligj-0.7.2 contextily-1.6.2 mercantile-1.2.1 rasterio-1.4.3
Collecting mapclassify
  Downloading mapclassify-2.8.1-py3-none-any.whl.metadata (2.8 kB)
Requirement already satisfied: networkx>=2.7 in /usr/local/lib/python3.11/dist-packages (from mapclassify) (3.4.2)
Requirement already satisfied: numpy>=1.23 in /usr/local/lib/python3.11/dist-packages (from mapclassify) (1.26.4)
Requirement already satisfied: pandas!=1.5.0,>=1.4 in /usr/local/lib/python3.11/dist-packages (from mapclassify) (2.2.2)
Requirement already satisfied: scikit-learn>=1.0 in /usr/local/lib/python3.11/dist-packages (from mapclassify) (1.6.1)
Requirement already satisfied: scipy>=1.8 in /usr/local/lib/python3.11/dist-packages (from mapclassify) (1.13.1)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas!=1.5.0,>=1.4->mapclassify) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas!=1.5.0,>=1.4->mapclassify) (2025.1)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas!=1.5.0,>=1.4->mapclassify) (2025.1)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn>=1.0->mapclassify) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn>=1.0->mapclassify) (3.5.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->pandas!=1.5.0,>=1.4->mapclassify) (1.17.0)
Downloading mapclassify-2.8.1-py3-none-any.whl (59 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59.1/59.1 kB 2.9 MB/s eta 0:00:00
Installing collected packages: mapclassify
Successfully installed mapclassify-2.8.1

import python packages

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
import contextily as cx
import mapclassify as mc

Load the Data

In [6]:
housing = pd.read_csv("/content/drive/MyDrive/ANLY 6110_ II/Module 4 Training Model /M4P2 EtE Machine Learning Project/housing.csv")
In [7]:
us_gdf = gpd.read_file('/content/drive/MyDrive/ANLY 6110_ II/Module 4 Training Model /M4P2 EtE Machine Learning Project/map/cb_2023_us_state_500k.shp')

Take a Quick Look at the Data Structure

In [19]:
housing
Out[19]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
... ... ... ... ... ... ... ... ... ... ...
20635 -121.09 39.48 25.0 1665.0 374.0 845.0 330.0 1.5603 78100.0 INLAND
20636 -121.21 39.49 18.0 697.0 150.0 356.0 114.0 2.5568 77100.0 INLAND
20637 -121.22 39.43 17.0 2254.0 485.0 1007.0 433.0 1.7000 92300.0 INLAND
20638 -121.32 39.43 18.0 1860.0 409.0 741.0 349.0 1.8672 84700.0 INLAND
20639 -121.24 39.37 16.0 2785.0 616.0 1387.0 530.0 2.3886 89400.0 INLAND

20640 rows × 10 columns

This data includes metrics such as the population, median income, and median housing price for each block group in California. Block groups are the smallest geographical unit for which the US Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). I will call them “districts” for short. Your model should learn from this data and be able to predict the median housing price in any district, given all the other metrics.

In [ ]:

In [10]:
housing.shape
Out[10]:
(20640, 10)
In [21]:
housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

There are 20,640 instances in the dataset, which means that it is fairly small by machine learning standards, but it’s perfect to get started. We also notice that the total_bedrooms attribute has only 20,433 non-null values, meaning that 207 districts are missing this feature. We will need to take care of this later.

All attributes are numerical, except for ocean_proximity. Its type is object, so it could hold any kind of Python object. But since we loaded this data from a CSV file, we know that it must be a text attribute. When we looked at the top five rows,we notice that the values in the ocean_proximity column were repetitive, which means that it is probably a categorical attribute. We can find out what categories exist and how many districts belong to each category by using the value_counts() method:

In [11]:
housing["ocean_proximity"].value_counts()
Out[11]:
count
ocean_proximity
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5

In [24]:
housing.describe()
Out[24]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000
In [ ]:
housing.columns
Out[ ]:
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')
In [8]:
new_table = housing[["population","median_income","ocean_proximity"]].copy()
new_table
Out[8]:
population median_income ocean_proximity
0 322.0 8.3252 NEAR BAY
1 2401.0 8.3014 NEAR BAY
2 496.0 7.2574 NEAR BAY
3 558.0 5.6431 NEAR BAY
4 565.0 3.8462 NEAR BAY
... ... ... ...
20635 845.0 1.5603 INLAND
20636 356.0 2.5568 INLAND
20637 1007.0 1.7000 INLAND
20638 741.0 1.8672 INLAND
20639 1387.0 2.3886 INLAND

20640 rows × 3 columns

In [9]:
c = housing["total_rooms"] # let us have a look to same data, let us check room destributions
c
Out[9]:
total_rooms
0 880.0
1 7099.0
2 1467.0
3 1274.0
4 1627.0
... ...
20635 1665.0
20636 697.0
20637 2254.0
20638 1860.0
20639 2785.0

20640 rows × 1 columns


Create a Test Set

In [10]:
c = housing["ocean_proximity"]
vc = c.value_counts()
vc
Out[10]:
count
ocean_proximity
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5

In [29]:
fig = plt.figure(figsize= (8,6))
ax1 =plt.subplot(1,1,1)
plt.bar(vc.index, vc.values)
plt.show
Out[29]:
matplotlib.pyplot.show
def show(*args, **kwargs) -> None
Display all open figures.

Parameters
----------
block : bool, optional
    Whether to wait for all figures to be closed before returning.

    If `True` block and run the GUI main loop until all figure windows
    are closed.

    If `False` ensure that all figure windows are displayed and return
    immediately.  In this case, you are responsible for ensuring
    that the event loop is running to have responsive figures.

    Defaults to True in non-interactive mode and to False in interactive
    mode (see `.pyplot.isinteractive`).

See Also
--------
ion : Enable interactive mode, which shows / updates the figure after
      every plotting command, so that calling ``show()`` is not necessary.
ioff : Disable interactive mode.
savefig : Save the figure to an image file instead of showing it on screen.

Notes
-----
**Saving figures to file and showing a window at the same time**

If you want an image file as well as a user interface window, use
`.pyplot.savefig` before `.pyplot.show`. At the end of (a blocking)
``show()`` the figure is closed and thus unregistered from pyplot. Calling
`.pyplot.savefig` afterwards would save a new and thus empty figure. This
limitation of command order does not apply if the show is non-blocking or
if you keep a reference to the figure and use `.Figure.savefig`.

**Auto-show in jupyter notebooks**

The jupyter backends (activated via ``%matplotlib inline``,
``%matplotlib notebook``, or ``%matplotlib widget``), call ``show()`` at
the end of every cell by default. Thus, you usually don't have to call it
explicitly there.
In [30]:
fig =plt.figure(figsize= (8,6))
ax1 =plt.subplot(1,1,1)
plt.scatter(housing['housing_median_age'],housing['population'], c= housing['population'],s=20, cmap='jet') # removed plt.style
plt.colorbar()
plt.xlabel('housing_median_age')
plt.ylabel('population')
plt.show()

Looks like the majority of populaiton are living in the houses with 0-10 years of age. The oder the houses, the less popluation are living there.

In [ ]:
housing.columns
Out[ ]:
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')
In [49]:

To check for correlation between attributes we need to use the Pandas scatter_matrix() function, which plots every numerical attribute against every other numerical attribute. Since there are now 11 numerical attributes, we would get 112 = 121 plots, which would not fit on a page—so we decided to focus on a few promising attributes that seem most correlated with the median housing value

In [47]:
from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
Out[47]:
array([[<Axes: xlabel='median_house_value', ylabel='median_house_value'>,
        <Axes: xlabel='median_income', ylabel='median_house_value'>,
        <Axes: xlabel='total_rooms', ylabel='median_house_value'>,
        <Axes: xlabel='housing_median_age', ylabel='median_house_value'>],
       [<Axes: xlabel='median_house_value', ylabel='median_income'>,
        <Axes: xlabel='median_income', ylabel='median_income'>,
        <Axes: xlabel='total_rooms', ylabel='median_income'>,
        <Axes: xlabel='housing_median_age', ylabel='median_income'>],
       [<Axes: xlabel='median_house_value', ylabel='total_rooms'>,
        <Axes: xlabel='median_income', ylabel='total_rooms'>,
        <Axes: xlabel='total_rooms', ylabel='total_rooms'>,
        <Axes: xlabel='housing_median_age', ylabel='total_rooms'>],
       [<Axes: xlabel='median_house_value', ylabel='housing_median_age'>,
        <Axes: xlabel='median_income', ylabel='housing_median_age'>,
        <Axes: xlabel='total_rooms', ylabel='housing_median_age'>,
        <Axes: xlabel='housing_median_age', ylabel='housing_median_age'>]],
      dtype=object)

Looking at the correlation scatterplots, it seems like the most promising attribute to predict the median house value is the median income, so let us zoom in its scatterplot

In [48]:
housing.plot(kind="scatter", x="median_income", y="median_house_value",
             alpha=0.1, grid=True)
plt.show()

This plot reveals that the correlation is indeed quite strong; we can clearly see the upward trend, and the points are not too dispersed. Also, the price cap we noticed earlier is clearly visible as a horizontal line at $500,000. But the plot also reveals other less obvious straight lines: a horizontal line around $450,000, another around $350,000, perhaps one around $280,000, and a few more below that. We may want to try removing the corresponding districts to prevent our algorithms from learning to reproduce these data quirks.

Visualizing Geographical Data

In [12]:
# Next let show our scatter plot on the map
us_gdf.head(5) #this is our map's data, let us change the index
Out[12]:
STATEFP STATENS GEOIDFQ GEOID STUSPS NAME LSAD ALAND AWATER geometry
STUSPS
NM 35 00897535 0400000US35 35 NM New Mexico 00 314198587197 726463919 POLYGON ((-109.05017 31.48, -109.04984 31.4995...
SD 46 01785534 0400000US46 46 SD South Dakota 00 196341525171 3387709166 POLYGON ((-104.05788 44.9976, -104.05078 44.99...
CA 06 01779778 0400000US06 06 CA California 00 403673296401 20291770234 MULTIPOLYGON (((-118.60442 33.47855, -118.5987...
KY 21 01779786 0400000US21 21 KY Kentucky 00 102266598312 2384223544 MULTIPOLYGON (((-89.40565 36.52816, -89.39868 ...
AL 01 01779775 0400000US01 01 AL Alabama 00 131185049346 4582326383 MULTIPOLYGON (((-88.05338 30.50699, -88.05109 ...
In [11]:
us_gdf.set_index(['STUSPS'], drop=False, inplace=True) # false means to would like to keep the previous index and True means to modify my original data
us_gdf
Out[11]:
STATEFP STATENS GEOIDFQ GEOID STUSPS NAME LSAD ALAND AWATER geometry
STUSPS
NM 35 00897535 0400000US35 35 NM New Mexico 00 314198587197 726463919 POLYGON ((-109.05017 31.48, -109.04984 31.4995...
SD 46 01785534 0400000US46 46 SD South Dakota 00 196341525171 3387709166 POLYGON ((-104.05788 44.9976, -104.05078 44.99...
CA 06 01779778 0400000US06 06 CA California 00 403673296401 20291770234 MULTIPOLYGON (((-118.60442 33.47855, -118.5987...
KY 21 01779786 0400000US21 21 KY Kentucky 00 102266598312 2384223544 MULTIPOLYGON (((-89.40565 36.52816, -89.39868 ...
AL 01 01779775 0400000US01 01 AL Alabama 00 131185049346 4582326383 MULTIPOLYGON (((-88.05338 30.50699, -88.05109 ...
GA 13 01705317 0400000US13 13 GA Georgia 00 149485311347 4419673221 MULTIPOLYGON (((-81.27939 31.30792, -81.27716 ...
AR 05 00068085 0400000US05 05 AR Arkansas 00 134660466558 3122251184 POLYGON ((-94.61792 36.49941, -94.61765 36.499...
PA 42 01779798 0400000US42 42 PA Pennsylvania 00 115881839569 3397855215 POLYGON ((-80.51989 40.90666, -80.51963 40.911...
MO 29 01779791 0400000US29 29 MO Missouri 00 178052260322 2487519141 POLYGON ((-95.77355 40.5782, -95.76853 40.5833...
CO 08 01779779 0400000US08 08 CO Colorado 00 268418756810 1185758065 POLYGON ((-109.06025 38.59933, -109.05954 38.7...
UT 49 01455989 0400000US49 49 UT Utah 00 213921849163 5963196691 POLYGON ((-114.05296 37.59278, -114.05247 37.6...
OK 40 01102857 0400000US40 40 OK Oklahoma 00 177664484361 3373395450 POLYGON ((-103.00256 36.52659, -103.00219 36.6...
TN 47 01325873 0400000US47 47 TN Tennessee 00 106792311478 2322248149 POLYGON ((-90.31045 35.0027, -90.30926 35.0095...
WY 56 01779807 0400000US56 56 WY Wyoming 00 251458162746 1868053273 POLYGON ((-111.05456 45.00096, -111.04507 45.0...
NY 36 01779796 0400000US36 36 NY New York 00 122049078273 19256833049 MULTIPOLYGON (((-72.0377 41.25128, -72.03472 4...
IN 18 00448508 0400000US18 18 IN Indiana 00 92786613552 1543998356 POLYGON ((-88.09776 37.90403, -88.09448 37.905...
KS 20 00481813 0400000US20 20 KS Kansas 00 211753777631 1345707708 POLYGON ((-102.05174 40.00308, -101.9167 40.00...
ID 16 01779783 0400000US16 16 ID Idaho 00 214049886849 2391614331 POLYGON ((-117.24268 44.39655, -117.23484 44.3...
AK 02 01785533 0400000US02 02 AK Alaska 00 1479016910296 245347100126 MULTIPOLYGON (((-131.61758 54.94795, -131.6107...
NV 32 01779793 0400000US32 32 NV Nevada 00 284537045712 1839880833 POLYGON ((-120.00645 39.27288, -120.00643 39.2...
IL 17 01779784 0400000US17 17 IL Illinois 00 143778366814 6216688589 POLYGON ((-91.51297 40.18106, -91.51107 40.188...
VT 50 01779802 0400000US50 50 VT Vermont 00 23872589127 1030648383 POLYGON ((-73.43774 44.04501, -73.43199 44.063...
MN 27 00662849 0400000US27 27 MN Minnesota 00 206244555303 18937471947 MULTIPOLYGON (((-89.59206 47.96668, -89.59147 ...
IA 19 01779785 0400000US19 19 IA Iowa 00 144659283871 1086402333 POLYGON ((-96.63836 42.7355, -96.63797 42.7363...
SC 45 01779799 0400000US45 45 SC South Carolina 00 77866000436 5074443652 MULTIPOLYGON (((-79.50795 33.02008, -79.50713 ...
NH 33 01779794 0400000US33 33 NH New Hampshire 00 23190126365 1025956733 MULTIPOLYGON (((-70.61702 42.97718, -70.61529 ...
DE 10 01779781 0400000US10 10 DE Delaware 00 5046703781 1399207462 MULTIPOLYGON (((-75.56752 39.5102, -75.56477 3...
DC 11 01702382 0400000US11 11 DC District of Columbia 00 158316184 18709787 POLYGON ((-77.11976 38.93434, -77.11253 38.940...
AS 60 01802701 0400000US60 60 AS American Samoa 00 197759067 1307243751 MULTIPOLYGON (((-168.14582 -14.54791, -168.145...
CT 09 01779780 0400000US09 09 CT Connecticut 00 12541750274 1816364426 MULTIPOLYGON (((-72.22593 41.29384, -72.22523 ...
MI 26 01779789 0400000US26 26 MI Michigan 00 146619947556 103866186527 MULTIPOLYGON (((-83.19159 42.03537, -83.18993 ...
MA 25 00606926 0400000US25 25 MA Massachusetts 00 20204345054 7130705555 MULTIPOLYGON (((-70.23405 41.28565, -70.22122 ...
FL 12 00294478 0400000US12 12 FL Florida 00 138963763779 45970528648 MULTIPOLYGON (((-80.17628 25.52505, -80.17395 ...
VI 78 01802710 0400000US78 78 VI United States Virgin Islands 00 348021909 1550236187 MULTIPOLYGON (((-64.62765 17.78857, -64.62727 ...
NJ 34 01779795 0400000US34 34 NJ New Jersey 00 19049183398 3532816229 MULTIPOLYGON (((-74.0422 40.69997, -74.039 40....
ND 38 01779797 0400000US38 38 ND North Dakota 00 178694342677 4414747947 POLYGON ((-104.04868 48.86378, -104.04865 48.8...
MD 24 01714934 0400000US24 24 MD Maryland 00 25151736098 6979330958 MULTIPOLYGON (((-76.04998 37.99011, -76.04865 ...
ME 23 01779787 0400000US23 23 ME Maine 00 79888284131 11745177995 MULTIPOLYGON (((-67.3226 44.6116, -67.32174 44...
HI 15 01779782 0400000US15 15 HI Hawaii 00 16634423916 11777375352 MULTIPOLYGON (((-156.06076 19.73055, -156.0566...
GU 66 01802705 0400000US66 66 GU Guam 00 543555846 934337453 MULTIPOLYGON (((144.64538 13.23627, 144.64764 ...
MP 69 01779809 0400000US69 69 MP Commonwealth of the Northern Mariana Islands 00 472292520 4644252458 MULTIPOLYGON (((145.6327 16.36647, 145.63295 1...
RI 44 01219835 0400000US44 44 RI Rhode Island 00 2677763372 1323686976 MULTIPOLYGON (((-71.28802 41.64558, -71.28647 ...
MT 30 00767982 0400000US30 30 MT Montana 00 376973185941 3867177570 POLYGON ((-116.04914 48.50205, -116.04913 48.5...
AZ 04 01779777 0400000US04 04 AZ Arizona 00 294366106734 854003932 POLYGON ((-114.81629 32.50804, -114.81432 32.5...
NE 31 01779792 0400000US31 31 NE Nebraska 00 198949602728 1379309601 POLYGON ((-104.05342 41.17054, -104.05321 41.1...
WA 53 01779804 0400000US53 53 WA Washington 00 172118778986 12548923076 MULTIPOLYGON (((-122.33164 48.02056, -122.3283...
PR 72 01779808 0400000US72 72 PR Puerto Rico 00 8869031577 4922247037 MULTIPOLYGON (((-65.23773 18.32118, -65.23612 ...
TX 48 01779801 0400000US48 48 TX Texas 00 676686238592 18982083586 MULTIPOLYGON (((-94.7183 29.72886, -94.71721 2...
OH 39 01085497 0400000US39 39 OH Ohio 00 105823831336 10274524796 MULTIPOLYGON (((-82.73447 41.60351, -82.72425 ...
WI 55 01779806 0400000US55 55 WI Wisconsin 00 140292627460 29343084365 MULTIPOLYGON (((-86.8309 45.42602, -86.82866 4...
OR 41 01155107 0400000US41 41 OR Oregon 00 248630419895 6168960338 MULTIPOLYGON (((-123.66475 46.24431, -123.6489...
MS 28 01779790 0400000US28 28 MS Mississippi 00 121533540877 3914738613 MULTIPOLYGON (((-88.50502 30.21574, -88.49164 ...
NC 37 01027616 0400000US37 37 NC North Carolina 00 125935880061 13453540851 MULTIPOLYGON (((-75.72681 35.93584, -75.71827 ...
VA 51 01779803 0400000US51 51 VA Virginia 00 102258163252 8528087616 MULTIPOLYGON (((-75.74241 37.80835, -75.74151 ...
WV 54 01779805 0400000US54 54 WV West Virginia 00 62266499712 489003081 POLYGON ((-82.6432 38.16909, -82.643 38.16956,...
LA 22 01629543 0400000US22 22 LA Louisiana 00 111930452904 23721187320 MULTIPOLYGON (((-88.8677 29.86155, -88.86566 2...
In [49]:
fig =plt.figure(figsize= (8,6))
ax1 =plt.subplot(1,1,1)
plt.scatter(housing['longitude'],housing['latitude'], s=housing['population']/100, c=housing['median_house_value'], cmap='jet')
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.colorbar()
plt.show()
In [34]:
from inspect import Attribute
#now lets plot the US map:
fig =plt.figure(figsize=(30,25))
ax1=plt.subplot(1,1,1)
us_gdf.boundary.plot(ax=ax1, color="black")
cx.add_basemap(ax=ax1,crs=us_gdf.crs,attribution="", source=cx.providers.OpenTopoMap)
plt.axis(False)
plt.show()

Discover and Visualize the Data to Gain Insights

In [13]:
# we we need to plot the CA map for which we need to extract the CA data as following:

ca_gdf = us_gdf.loc[["CA"]]
ca_gdf
Out[13]:
STATEFP STATENS GEOIDFQ GEOID STUSPS NAME LSAD ALAND AWATER geometry
STUSPS
CA 06 01779778 0400000US06 06 CA California 00 403673296401 20291770234 MULTIPOLYGON (((-118.60442 33.47855, -118.5987...
In [45]:
fig =plt.figure(figsize=(10,8))
ax1=plt.subplot(1,1,1)
ca_gdf.boundary.plot(ax=ax1, color="black")
cx.add_basemap(ax=ax1,crs=ca_gdf.crs,attribution="", source=cx.providers.OpenTopoMap)
plt.scatter(housing['longitude'],housing['latitude'], s=housing['population']/100, c=housing['median_house_value'], cmap='jet', label='population')
plt.legend()
plt.colorbar()
plt.axis(False)
plt.show()

Prepare the Data for Machine Learning Algorithms

Let's revert to the original training set and separate the target (note that strat_train_set.drop() creates a copy of strat_train_set without the column, it doesn't actually modify strat_train_set itself, unless you pass inplace=True):

In [ ]:
# To clean the data, we can have the following strategies:
#Process the missing values,
#Drop the colums that include missing values (which is nto a good idea ad it removes the whole column)
#Drop the row (We will use this approach here)
#Fill the values with mean, median (If we would like use this strategy we need to have a valid justification for it)
In [12]:
housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
In [14]:
housing.dropna(subset=["total_bedrooms"], axis = 0)  #axis = 0 means we need to delete the rows with missing values. applying the drop function creates a new dataset, the orriginal data still has missing values.
Out[14]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
... ... ... ... ... ... ... ... ... ... ...
20635 -121.09 39.48 25.0 1665.0 374.0 845.0 330.0 1.5603 78100.0 INLAND
20636 -121.21 39.49 18.0 697.0 150.0 356.0 114.0 2.5568 77100.0 INLAND
20637 -121.22 39.43 17.0 2254.0 485.0 1007.0 433.0 1.7000 92300.0 INLAND
20638 -121.32 39.43 18.0 1860.0 409.0 741.0 349.0 1.8672 84700.0 INLAND
20639 -121.24 39.37 16.0 2785.0 616.0 1387.0 530.0 2.3886 89400.0 INLAND

20433 rows × 10 columns

In [15]:
#if we want to drop the column with missing value, the we put the axis = 1
housing.drop(["total_bedrooms"], axis = 1)
Out[15]:
longitude latitude housing_median_age total_rooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 565.0 259.0 3.8462 342200.0 NEAR BAY
... ... ... ... ... ... ... ... ... ...
20635 -121.09 39.48 25.0 1665.0 845.0 330.0 1.5603 78100.0 INLAND
20636 -121.21 39.49 18.0 697.0 356.0 114.0 2.5568 77100.0 INLAND
20637 -121.22 39.43 17.0 2254.0 1007.0 433.0 1.7000 92300.0 INLAND
20638 -121.32 39.43 18.0 1860.0 741.0 349.0 1.8672 84700.0 INLAND
20639 -121.24 39.37 16.0 2785.0 1387.0 530.0 2.3886 89400.0 INLAND

20640 rows × 9 columns

In [15]:
#Fill the values with mean, median, the third strategy
# let us fill the null values with average

avg = housing["total_bedrooms"].mean()
avg
Out[15]:
537.8705525375618
In [16]:
housing["total_bedrooms"].fillna(avg)
Out[16]:
total_bedrooms
0 129.0
1 1106.0
2 190.0
3 235.0
4 280.0
... ...
20635 374.0
20636 150.0
20637 485.0
20638 409.0
20639 616.0

20640 rows × 1 columns


In [17]:
#for our project we use the first strategy ( dorping the rows with null values)
housing.dropna(subset=["total_bedrooms"], axis = 0, inplace=True) # inplace=True bring changes in the original dataset.
housing.info()
<class 'pandas.core.frame.DataFrame'>
Index: 20433 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20433 non-null  float64
 1   latitude            20433 non-null  float64
 2   housing_median_age  20433 non-null  float64
 3   total_rooms         20433 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20433 non-null  float64
 6   households          20433 non-null  float64
 7   median_income       20433 non-null  float64
 8   median_house_value  20433 non-null  float64
 9   ocean_proximity     20433 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.7+ MB

Separating out the numerical attributes to use the "median" strategy (as it cannot be calculated on text attributes like ocean_proximity):

Handling Text and Categorical Attributes

Now let's preprocess the categorical input feature, ocean_proximity: preparing the data for a machine learning model. Specifically categorical feature "ocean_proximity"

In [18]:
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
In [26]:
housing["ocean_proximity"]
Out[26]:
ocean_proximity
0 NEAR BAY
1 NEAR BAY
2 NEAR BAY
3 NEAR BAY
4 NEAR BAY
... ...
20635 INLAND
20636 INLAND
20637 INLAND
20638 INLAND
20639 INLAND

20433 rows × 1 columns


In [ ]:
# As we can see, the ocean_proximity column is string, we need to change it to intiger, unsing the ordinaryencoder and onehotencoder
# we use ordinaryencoder if the o_c is in order, if not we use onehotencoder. So in our case we use orndinaryencoder
In [19]:
vc = housing["ocean_proximity"].value_counts()
vc
Out[19]:
count
ocean_proximity
<1H OCEAN 9034
INLAND 6496
NEAR OCEAN 2628
NEAR BAY 2270
ISLAND 5

In [20]:
#OrdinaryEncoder
ordinal_encoder = OrdinalEncoder()   # Now let us ask it to learn from data
In [21]:
ordinal_encoder.fit(housing[["ocean_proximity"]])
Out[21]:
OrdinalEncoder()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [22]:
#Let us see what is inside our encoder
ordinal_encoder.categories_
Out[22]:
[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]
In [23]:
#Let us transform it. this can catagorize our data from 0 to 4
ordinal_encoder.transform(housing[["ocean_proximity"]])
Out[23]:
array([[3.],
       [3.],
       [3.],
       ...,
       [1.],
       [1.],
       [1.]])
In [26]:
#if we use onehotencoder we will have five column because we have fieve catagories. for each column the value in each cell can 0 and 1
#And we use this becase our data is not in order
from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder()
onehot_encoder.fit(housing[["ocean_proximity"]])
Out[26]:
OneHotEncoder()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [28]:
X_P1= onehot_encoder.transform(housing[["ocean_proximity"]]).toarray()
X_P1
#By default, the `OneHotEncoder` class returns a sparse array, but we can convert it to a dense array if needed by calling the `toarray()
 # so, we will use it and let us to save it
Out[28]:
array([[0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       ...,
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])
In [29]:
onehot_encoder.get_feature_names_out()
Out[29]:
array(['ocean_proximity_<1H OCEAN', 'ocean_proximity_INLAND',
       'ocean_proximity_ISLAND', 'ocean_proximity_NEAR BAY',
       'ocean_proximity_NEAR OCEAN'], dtype=object)
In [30]:
# create a data fram
X_P1_df = pd.DataFrame(X_P1, columns=onehot_encoder.get_feature_names_out(), index=housing[["ocean_proximity"]].index)
X_P1_df
Out[30]:
ocean_proximity_<1H OCEAN ocean_proximity_INLAND ocean_proximity_ISLAND ocean_proximity_NEAR BAY ocean_proximity_NEAR OCEAN
0 0.0 0.0 0.0 1.0 0.0
1 0.0 0.0 0.0 1.0 0.0
2 0.0 0.0 0.0 1.0 0.0
3 0.0 0.0 0.0 1.0 0.0
4 0.0 0.0 0.0 1.0 0.0
... ... ... ... ... ...
20635 0.0 1.0 0.0 0.0 0.0
20636 0.0 1.0 0.0 0.0 0.0
20637 0.0 1.0 0.0 0.0 0.0
20638 0.0 1.0 0.0 0.0 0.0
20639 0.0 1.0 0.0 0.0 0.0

20433 rows × 5 columns

In [31]:
#Now let us combine our data frams into one
housing_df = pd.merge(left=housing, right=X_P1_df, left_index=True, right_index=True)
housing_df
Out[31]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity ocean_proximity_<1H OCEAN ocean_proximity_INLAND ocean_proximity_ISLAND ocean_proximity_NEAR BAY ocean_proximity_NEAR OCEAN
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 0.0 0.0 0.0 1.0 0.0
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY 0.0 0.0 0.0 1.0 0.0
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY 0.0 0.0 0.0 1.0 0.0
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY 0.0 0.0 0.0 1.0 0.0
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY 0.0 0.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20635 -121.09 39.48 25.0 1665.0 374.0 845.0 330.0 1.5603 78100.0 INLAND 0.0 1.0 0.0 0.0 0.0
20636 -121.21 39.49 18.0 697.0 150.0 356.0 114.0 2.5568 77100.0 INLAND 0.0 1.0 0.0 0.0 0.0
20637 -121.22 39.43 17.0 2254.0 485.0 1007.0 433.0 1.7000 92300.0 INLAND 0.0 1.0 0.0 0.0 0.0
20638 -121.32 39.43 18.0 1860.0 409.0 741.0 349.0 1.8672 84700.0 INLAND 0.0 1.0 0.0 0.0 0.0
20639 -121.24 39.37 16.0 2785.0 616.0 1387.0 530.0 2.3886 89400.0 INLAND 0.0 1.0 0.0 0.0 0.0

20433 rows × 15 columns

Splitting train and test dataset

In [33]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(housing_df, test_size=0.2, stratify= housing_df["ocean_proximity"])
train_df
Out[33]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity ocean_proximity_<1H OCEAN ocean_proximity_INLAND ocean_proximity_ISLAND ocean_proximity_NEAR BAY ocean_proximity_NEAR OCEAN
5231 -121.36 37.99 8.0 1801.0 380.0 684.0 350.0 4.2589 134900.0 INLAND 0.0 1.0 0.0 0.0 0.0
20534 -122.33 38.38 28.0 1020.0 169.0 504.0 164.0 4.5694 287500.0 INLAND 0.0 1.0 0.0 0.0 0.0
14949 -117.14 32.80 41.0 2423.0 469.0 1813.0 466.0 2.1157 156900.0 NEAR OCEAN 0.0 0.0 0.0 0.0 1.0
7924 -122.26 38.31 33.0 4518.0 704.0 1776.0 669.0 5.2444 281100.0 NEAR BAY 0.0 0.0 0.0 1.0 0.0
11076 -123.21 39.18 17.0 2772.0 576.0 1501.0 584.0 2.6275 142100.0 <1H OCEAN 1.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20116 -123.13 40.85 18.0 1650.0 377.0 675.0 282.0 1.8933 84700.0 INLAND 0.0 1.0 0.0 0.0 0.0
11054 -117.94 34.15 33.0 859.0 144.0 421.0 138.0 4.4821 220100.0 INLAND 0.0 1.0 0.0 0.0 0.0
16299 -122.51 37.53 17.0 1574.0 262.0 672.0 241.0 7.2929 355800.0 NEAR OCEAN 0.0 0.0 0.0 0.0 1.0
7261 -118.26 34.04 6.0 1529.0 566.0 1051.0 473.0 2.4620 162500.0 <1H OCEAN 1.0 0.0 0.0 0.0 0.0
20235 -118.35 34.07 52.0 2497.0 406.0 1030.0 412.0 4.8900 500001.0 <1H OCEAN 1.0 0.0 0.0 0.0 0.0

16346 rows × 15 columns

In [ ]:

In [ ]:

Feature Scaling

In [ ]:
# Here I would explian two methods: min max scaler and stadard scaler
#Some Rules:
#Scale: most of the time, we should scale on X (features).
#Learn (fit) from the train data.
#Apply transfom to both train and test data.
In [ ]:
# min max:  (x-min)/(max-min) [0,1]
# standard: normal distribution (x-mean)/std_deviation [-3,3]
In [35]:
# let us specify our x_train data first and for this we do not consider the hot encoder columns that we created before, we will just use the first 8 columns.
train_df.columns
Out[35]:
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity', 'ocean_proximity_<1H OCEAN',
       'ocean_proximity_INLAND', 'ocean_proximity_ISLAND',
       'ocean_proximity_NEAR BAY', 'ocean_proximity_NEAR OCEAN'],
      dtype='object')
In [41]:
X_P1_columns = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income']  # median_house_value will by our Y, so for the first part we considered the first 8 columns
In [42]:
X_train_P1 = train_df[X_P1_columns]
X_train_P1
Out[42]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income
5231 -121.36 37.99 8.0 1801.0 380.0 684.0 350.0 4.2589
20534 -122.33 38.38 28.0 1020.0 169.0 504.0 164.0 4.5694
14949 -117.14 32.80 41.0 2423.0 469.0 1813.0 466.0 2.1157
7924 -122.26 38.31 33.0 4518.0 704.0 1776.0 669.0 5.2444
11076 -123.21 39.18 17.0 2772.0 576.0 1501.0 584.0 2.6275
... ... ... ... ... ... ... ... ...
20116 -123.13 40.85 18.0 1650.0 377.0 675.0 282.0 1.8933
11054 -117.94 34.15 33.0 859.0 144.0 421.0 138.0 4.4821
16299 -122.51 37.53 17.0 1574.0 262.0 672.0 241.0 7.2929
7261 -118.26 34.04 6.0 1529.0 566.0 1051.0 473.0 2.4620
20235 -118.35 34.07 52.0 2497.0 406.0 1030.0 412.0 4.8900

16346 rows × 8 columns

In [48]:
# now let us specify the second part, the catagorical part
X_P2_columns = ['ocean_proximity_<1H OCEAN',
       'ocean_proximity_INLAND', 'ocean_proximity_ISLAND',
       'ocean_proximity_NEAR BAY', 'ocean_proximity_NEAR OCEAN']
In [49]:
X_train_P2 = train_df[X_P2_columns]
X_train_P2
Out[49]:
ocean_proximity_<1H OCEAN ocean_proximity_INLAND ocean_proximity_ISLAND ocean_proximity_NEAR BAY ocean_proximity_NEAR OCEAN
5231 0.0 1.0 0.0 0.0 0.0
20534 0.0 1.0 0.0 0.0 0.0
14949 0.0 0.0 0.0 0.0 1.0
7924 0.0 0.0 0.0 1.0 0.0
11076 1.0 0.0 0.0 0.0 0.0
... ... ... ... ... ...
20116 0.0 1.0 0.0 0.0 0.0
11054 0.0 1.0 0.0 0.0 0.0
16299 0.0 0.0 0.0 0.0 1.0
7261 1.0 0.0 0.0 0.0 0.0
20235 1.0 0.0 0.0 0.0 0.0

16346 rows × 5 columns

In [50]:
y_train = train_df["median_house_value"]
y_train
Out[50]:
median_house_value
5231 134900.0
20534 287500.0
14949 156900.0
7924 281100.0
11076 142100.0
... ...
20116 84700.0
11054 220100.0
16299 355800.0
7261 162500.0
20235 500001.0

16346 rows × 1 columns


In [51]:
X_test = test_df[X_P1_columns]
X_test
Out[51]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income
20385 -121.37 38.62 43.0 1077.0 199.0 447.0 182.0 3.0139
7320 -122.59 38.58 18.0 3753.0 752.0 1454.0 668.0 3.7585
3603 -122.34 37.95 39.0 1986.0 427.0 1041.0 385.0 3.2333
1222 -117.11 32.82 17.0 1787.0 330.0 1341.0 314.0 2.8750
11851 -122.02 38.02 44.0 1465.0 247.0 817.0 237.0 4.8693
... ... ... ... ... ... ... ... ...
17209 -119.04 35.35 31.0 1607.0 336.0 817.0 307.0 2.5644
20609 -121.56 39.11 18.0 2171.0 480.0 1527.0 447.0 2.3011
20124 -117.41 34.58 14.0 859.0 212.0 541.0 181.0 1.6838
7403 -122.06 37.32 30.0 3033.0 540.0 1440.0 507.0 6.2182
14985 -120.84 38.77 11.0 1013.0 188.0 410.0 158.0 4.8250

4087 rows × 8 columns

In [52]:
X_test_P2 = test_df[X_P2_columns]
X_test_P2
Out[52]:
ocean_proximity_<1H OCEAN ocean_proximity_INLAND ocean_proximity_ISLAND ocean_proximity_NEAR BAY ocean_proximity_NEAR OCEAN
20385 0.0 1.0 0.0 0.0 0.0
7320 1.0 0.0 0.0 0.0 0.0
3603 0.0 0.0 0.0 1.0 0.0
1222 0.0 0.0 0.0 0.0 1.0
11851 0.0 0.0 0.0 1.0 0.0
... ... ... ... ... ...
17209 0.0 1.0 0.0 0.0 0.0
20609 0.0 1.0 0.0 0.0 0.0
20124 0.0 1.0 0.0 0.0 0.0
7403 1.0 0.0 0.0 0.0 0.0
14985 0.0 1.0 0.0 0.0 0.0

4087 rows × 5 columns

In [53]:
y_test = test_df["median_house_value"]
y_test
Out[53]:
median_house_value
20385 115600.0
7320 185700.0
3603 135100.0
1222 112500.0
11851 156900.0
... ...
17209 73000.0
20609 57500.0
20124 57900.0
7403 380800.0
14985 184600.0

4087 rows × 1 columns


Select and Train a Model

In [ ]:

Training and Evaluating on the Training Set

In [57]:
#So, we got our values and now let us train the model:

from sklearn.preprocessing import StandardScaler, MinMaxScaler

minmax_scaler = MinMaxScaler()

minmax_scaler.fit(X_train_P1)
Out[57]:
MinMaxScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Let's try the full preprocessing pipeline on a few training instances:

Compare against the actual values:

In [58]:
X_train_P1_scaled = minmax_scaler.transform(X_train_P1)
In [60]:
minmax_scaler.transform(X_test)
Out[60]:
array([[0.29681275, 0.64612115, 0.82352941, ..., 0.02723592, 0.03378757,
        0.17337692],
       [0.1752988 , 0.64187035, 0.33333333, ..., 0.08900748, 0.12450999,
        0.22472793],
       [0.2001992 , 0.5749203 , 0.74509804, ..., 0.06367317, 0.07168191,
        0.18850774],
       ...,
       [0.69123506, 0.21679065, 0.25490196, ..., 0.03300209, 0.0336009 ,
        0.08164715],
       [0.22808765, 0.50797024, 0.56862745, ..., 0.08814869, 0.09445585,
        0.39436008],
       [0.34960159, 0.66206164, 0.19607843, ..., 0.02496626, 0.02930745,
        0.29827864]])
In [61]:
# As explained earlier, we are not going to use the minmax scaler. In this porject we use the standard scaler
standard_scaler = StandardScaler()
standard_scaler.fit(X_train_P1)
Out[61]:
StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [62]:
# Because we use this method, let us apply and save it:
X_train_P1_scaled = standard_scaler.transform(X_train_P1)
X_train_P1_scaled

#and pringing this we will see most of the values are between 3 and -3
Out[62]:
array([[-0.89283139,  1.10436393, -1.64143972, ..., -0.6794491 ,
        -0.39430576,  0.20362128],
       [-1.37589809,  1.2868202 , -0.05000376, ..., -0.84421548,
        -0.88461269,  0.36714234],
       [ 1.20875775, -1.32370791,  0.98442961, ...,  0.35400228,
        -0.08852294, -0.92506908],
       ...,
       [-1.46553933,  0.88915911, -0.92529354, ..., -0.69043352,
        -0.68163617,  1.80144067],
       [ 0.65099002, -0.74359055, -1.80058332, ..., -0.34350875,
        -0.07007053, -0.74269437],
       [ 0.6061694 , -0.72955545,  1.85971939, ..., -0.36273149,
        -0.23087011,  0.53598246]])
In [63]:
X_test_P1_scaled = standard_scaler.transform(X_test)
In [67]:
#Now we are ready to slect machine learning model and train it.
#But before that we need to connect the features togather: or to cmpbain the part1 and part2 togather:

X_train = np.hstack([X_train_P1_scaled, X_train_P2. to_numpy()])
X_train.shape
Out[67]:
(16346, 13)
In [68]:
X_test = np.hstack([X_test_P1_scaled, X_test_P2.to_numpy()])
X_test.shape
Out[68]:
(4087, 13)
In [69]:
#Now we can create severa models: here we create a reggression model becasue we need to predict the median house value which is continuous number
#First we will use linear reggression:
from sklearn.linear_model import LinearRegression
l_m = LinearRegression(n_jobs=-1)
l_m.fit(X_train, y_train)
Out[69]:
LinearRegression(n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [110]:
#After building the model, let us varify the results
y_test_lm_pred = l_m.predict(X_test)
y_test_lm_pred
Out[110]:
array([131852.8900752 , 231243.98675866, 218091.95544456, ...,
        45366.58560448, 333170.97072183, 149337.06983711])
In [108]:
test_df['linear'] = y_test_lm_pred/y_test-1
In [71]:
# we predicted the value, no how we measure to know if it is good or bad? No we use square errors
from sklearn.metrics import mean_squared_error
lm_rmsc = mean_squared_error(y_true=y_test, y_pred=y_test_lm_pred)
lm_rmsc
Out[71]:
4969943836.149107
In [109]:
#let us calculate the error ratio:
y_test_lm_pred/y_test-1 # this means the proce here is higher by (0.14) or 14% that the ground throuth. - lower, + higher
Out[109]:
median_house_value
20385 0.140596
7320 0.245256
3603 0.614300
1222 0.385082
11851 0.676582
... ...
17209 0.724250
20609 0.051140
20124 -0.216467
7403 -0.125076
14985 -0.191023

4087 rows × 1 columns


In [107]:
#Now we can clacute the absolut value and its avrage. this can give is more cleare idea bout or models error:
np.average(np.abs(y_test_lm_pred/y_test-1))
Out[107]:
0.2947417310747705
In [ ]:
#So our linear model has 29% error

The MSE value of 4969943836.149107 in the context of our housing price prediction model.

Interpretation of MSE:

Magnitude: The MSE value is quite large. This suggests that, on average, the squared difference between our model's predicted housing prices and the actual housing prices is substantial. Units: We must remember that MSE is expressed in squared units of our target variable. In this case, it's squared dollars (since we're predicting house values). Desirability: Generally, a lower MSE is preferred, as it indicates better model accuracy. A very high MSE like this one suggests that the model's predictions could be significantly off from the actual values.

In conclusion: While the high MSE suggests the model is not performing optimally, it's not necessarily a dead end. By systematically exploring improvements in feature engineering, model selection, and hyperparameter tuning, we can likely achieve better predictive accuracy.

In [72]:
#Let us build anathor model
from sklearn.tree import DecisionTreeRegressor
dt_m = DecisionTreeRegressor()
dt_m.fit(X_train, y_train)
Out[72]:
DecisionTreeRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [73]:
y_test_treem_predected = dt_m.predict(X_test)

treem_rmsc = mean_squared_error(y_true=y_test, y_pred=y_test_treem_predected)
treem_rmsc
Out[73]:
4816919947.19036
In [112]:
np.average(np.abs(y_test_treem_predected/y_test-1))
Out[112]:
0.2374494908158694
In [99]:
test_df['Decission Tree'] = y_test_treem_predected/y_test-1
In [74]:
# we can also build a KNeighborsRegression
from sklearn.neighbors import KNeighborsRegressor
knn_m = KNeighborsRegressor()
knn_m.fit(X_train, y_train)
Out[74]:
KNeighborsRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [100]:
#Lets make predictin for this model:
y_test_knn_predected = knn_m.predict(X_test)

knn_rmsc = mean_squared_error(y_true=y_test, y_pred=y_test_knn_predected)
knn_rmsc
Out[100]:
3813762723.492351
In [113]:
test_df['KNN'] = y_test_knn_predected/y_test-1
In [ ]:

In [87]:
np.average(np.abs(y_test_knn_predected/y_test-1))
Out[87]:
0.22756127026740677
In [76]:
#Let us build Rondomforest model:
from sklearn.ensemble import RandomForestRegressor
rf_m = RandomForestRegressor()
rf_m.fit(X_train, y_train)
Out[76]:
RandomForestRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [ ]:

In [77]:
#Let us make prediction for this model:
y_test_rf_predected = rf_m.predict(X_test)

rf_rmsc = mean_squared_error(y_true=y_test, y_pred=y_test_rf_predected)
rf_rmsc
Out[77]:
2366839343.255092
In [ ]:

In [114]:
test_df['Random Forest'] = y_test_rf_predected/y_test-1
In [88]:
np.average(np.abs(y_test_rf_predected/y_test-1))
Out[88]:
0.17793944336971376
In [ ]:
# RF model looks much better with minimum erro comparing to all other models we observed.
In [91]:
#Our last model Svr
from sklearn.svm import SVR
svr_m = SVR()
svr_m.fit(X_train, y_train)
Out[91]:
SVR()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [104]:
y_test_srvm_pred = svr_m.predict(X_test)
svrm_rmsc = mean_squared_error(y_true=y_test, y_pred=y_test_srvm_pred)
svrm_rmsc
Out[104]:
14104956534.654736
In [115]:
test_df['SVR'] = y_test_srvm_pred/y_test-1
In [93]:
np.average(np.abs(SVR_pred/y_test-1))
Out[93]:
0.5291344129712162

Better Evaluation Using Cross-Validation

this is another way to check if a model is good or bad, but it will take too much time because it will create several model.

In [94]:
from sklearn.model_selection import cross_val_score
In [96]:
lm_C_rmsc = -cross_val_score(l_m, X_train, y_train, scoring="neg_mean_squared_error", cv=10) # we put a (_) at the bgining of the code to deal with the neg in "neg"_mean_squared..)
lm_C_rmsc
Out[96]:
array([4.78899632e+09, 4.60315653e+09, 4.60036685e+09, 4.48785698e+09,
       5.06802965e+09, 4.95085257e+09, 4.46552747e+09, 4.66734787e+09,
       4.55737710e+09, 4.49044276e+09])
In [97]:
lm_C_rmsc.mean()
Out[97]:
4667995410.170768

Visualizing the models

In [116]:
test_df
Out[116]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity ocean_proximity_<1H OCEAN ocean_proximity_INLAND ocean_proximity_ISLAND ocean_proximity_NEAR BAY ocean_proximity_NEAR OCEAN linear Decission Tree KNN Random Forest SVR
20385 -121.37 38.62 43.0 1077.0 199.0 447.0 182.0 3.0139 115600.0 INLAND 0.0 1.0 0.0 0.0 0.0 0.140596 136600.0 -0.034256 0.099057 0.550667
7320 -122.59 38.58 18.0 3753.0 752.0 1454.0 668.0 3.7585 185700.0 <1H OCEAN 1.0 0.0 0.0 0.0 0.0 0.245256 172600.0 0.147227 0.203452 -0.030884
3603 -122.34 37.95 39.0 1986.0 427.0 1041.0 385.0 3.2333 135100.0 NEAR BAY 0.0 0.0 0.0 1.0 0.0 0.614300 112700.0 0.321984 -0.062857 0.330790
1222 -117.11 32.82 17.0 1787.0 330.0 1341.0 314.0 2.8750 112500.0 NEAR OCEAN 0.0 0.0 0.0 0.0 1.0 0.385082 145800.0 0.139911 0.223342 0.597357
11851 -122.02 38.02 44.0 1465.0 247.0 817.0 237.0 4.8693 156900.0 NEAR BAY 0.0 0.0 0.0 1.0 0.0 0.676582 196200.0 0.434162 0.247903 0.148607
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
17209 -119.04 35.35 31.0 1607.0 336.0 817.0 307.0 2.5644 73000.0 INLAND 0.0 1.0 0.0 0.0 0.0 0.724250 82800.0 -0.218082 -0.019986 1.452465
20609 -121.56 39.11 18.0 2171.0 480.0 1527.0 447.0 2.3011 57500.0 INLAND 0.0 1.0 0.0 0.0 0.0 0.051140 54400.0 0.434087 0.369026 2.110714
20124 -117.41 34.58 14.0 859.0 212.0 541.0 181.0 1.6838 57900.0 INLAND 0.0 1.0 0.0 0.0 0.0 -0.216467 64700.0 1.153022 0.415250 2.092792
7403 -122.06 37.32 30.0 3033.0 540.0 1440.0 507.0 6.2182 380800.0 <1H OCEAN 1.0 0.0 0.0 0.0 0.0 -0.125076 344200.0 -0.145168 -0.046389 -0.525212
14985 -120.84 38.77 11.0 1013.0 188.0 410.0 158.0 4.8250 184600.0 INLAND 0.0 1.0 0.0 0.0 0.0 -0.191023 192300.0 -0.038678 -0.128131 -0.027720

4087 rows × 20 columns

In [120]:
#Frist we prepare our geopandas data for visualization
test_housing_gdf = gpd.GeoDataFrame(test_df, geometry=gpd.points_from_xy(test_df["longitude"], test_df["latitude"]), crs=ca_gdf.crs)
In [147]:
# let us copy our CA map and show our relsult on it
fig =plt.figure(figsize=(10,8))
ax1=plt.subplot(1,1,1)
ca_gdf.boundary.plot(ax=ax1, color="none")
cx.add_basemap(ax=ax1,crs=us_gdf.crs,attribution="", source=cx.providers.OpenStreetMap.Mapnik)
test_housing_gdf.plot(ax=ax1, markersize = test_housing_gdf["population"]/200, column=test_housing_gdf["median_house_value"], cmap=plt.cm.jet, legend=True)
plt.axis(False)


plt.axis(False)
plt.title("Median House Value in California Districts")  # Adding a title
plt.savefig("test_housing_gdf.png", dpi=600) #Fixed: Changed 'savfig' to 'savefig'


#To save this picture - This line is redundant as the figure is already saved above. you may remove it.
#plt.savfig("test_housing_gdf.png", dpi=600)
plt.show()
In [144]:
#Let us explore our map further
test_housing_gdf.explore(column=test_housing_gdf["median_house_value"], cmap="jet", legend=True)
Output hidden; open in https://colab.research.google.com to view.

The final exploratory map effectively visualizes the predicted median house values across California districts, using a color gradient to represent the price range. It allows for interactive exploration, enabling users to zoom in and out, pan across the map, and hover over individual points to see specific details like location and predicted value. This interactive feature provides a powerful tool for understanding the geographical distribution of housing prices and identifying potential hotspots or areas of interest in the California housing market."

.

Conclusion

This project provided a comprehensive exploration of predicting median house values in California districts using machine learning. By applying a systematic approach encompassing data exploration, preparation, model training, evaluation, and visualization, we gained valuable insights into the housing market.

The Random Forest Regression model emerged as the most effective among the tested models, demonstrating promising predictive capabilities. The geographical visualization highlighted spatial patterns in housing prices, revealing potential influencing factors such as proximity to urban centers and coastal areas.

While the achieved results are encouraging, further research and model refinements could enhance predictive accuracy. This could involve hyperparameter tuning, feature engineering, or integrating external datasets to incorporate additional relevant information.

Overall, this project demonstrates the potential of machine learning for data-driven decision-making in the real estate domain. By leveraging the insights gained, stakeholders can make more informed choices regarding property valuations, investment strategies, and market analysis.

Mohammad Jawad Nayosh

In [ ]: